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Abstract The detectors in mass spectrometers are precise enough to count ion events, 
but in practice large quantization errors affect the observations. To study the statistics 
of low intensity chemical noise, we model the detector signal asX — [tA^J and esti- 
mate both T and in a semi-parametric approach where the integer valued random 
variable represents the number of ions and t represents the gain parameter of the 
detector. When T < 1, we explain why the gain parameter cannot be recovered with- 
out a priori information on A^. When T > 1 however, A^ can be deduced from X and 
a sufficiently precise estimate of T. To perform parametric estimation of T, we first 
study simple estimators which provide useful upper bounds. We then introduce the 
concept of compatible lattices and we derive an optimal estimator that is independent 
of the law of A^. 
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1 Introduction 

1 . 1 Ion Detectors 

Mass spectrometers are instruments that ionize the compounds of a sample, separate 
the ions, then quantify the ions at each mass to charge ratio. The resulting signal is a 
histogram that represents ion intensity as a function of the mass to charge ratio of the 
ions. With sufficient precision in the separation and the mass to charge measurement, 
the components of the sample can be identified and quantified. Mass spectrometers 
are widely used for analysing very diverse mixtures, e.g. detecting explosives for 
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airport security or analysing oil products. See lUl la |9|] for an introduction to mass 
spectrometry in the life sciences. 

We consider detectors similar to microchannel plate detectors that are used in most 
mass spectrometers When an ion hits the detector plate, it produces an analog 
signal that is amplified, quantized, then reported to the cornputer. The level of quanti- 
zation is quite high as there may be only 2^' = 2048 level|^ in some instruments, and 
small signals as well as chemical noise are strongly affected by quantization effects. 

Specific difficulties have appeared with high-throughput analyses of biological ma- 
terial. In particular, biological samples may contain trace amounts of molecules of 
interest. These are difficult to di sting uish from chemical noise which produces pat- 
terns similar to real signals S llOl] . In 2004, suggested Poisson-like behaviour 
for the ion intensity based on a linear relationship between the mean and variance of 
the noise. This linear relationship suggests that the amplification factor of the detector 
may be unaccounted for in the data set. 

To study chemical noise in the experimental data, we interpret the amplification fac- 
tor as an overdispersion parameter in a semi -parametric approach. To study chemical 
noise in the experimental data, we estimate the amplification factor and an unknown 
distribution for the chemical noise in a semi-parametric approach. Let denote the 
number of chemical noise ions that reach the detector, we consider the following 
observation model; 

X= lxN + e\ 

where the noisy signal tA^ + e is truncated before observation. T represents the am- 
plification factor of the detector and e represents electronic noise. In this paper, we 
make the assumption that e = or equivalently that there are only quantization er- 
rors in the measurements. The observation model is associated with the statistical 
structure (N, Q^jPg), with = (t,A^) where T is a positive real number and is a 
probability distribution on N. 

We believe a priori that is Poisson distributed as it models rare events (ion counts). 
Consequently, t can be interpreted as an overdispersion parameter affecting Poisson 
distributed observations. This has been tackled in the framework of double exponen- 
tial families, as presented by Efron in fst). In 101, Antoniadis et al use double expo- 
nential families in a regression model to analyse diffraction spectra. This corresponds 
to estimating the regression function jXj in the model X'j — zNi where A^, is Poisson 
distributed with varying parameter /!,. However, this framework does not explicitly 
take into account quantization errors and thus provides poor parameter estimates as 
we will show in Section 14.21 Moreover, we wish to confirm the Poisson hypothesis 
using non parametric estimation. 

Our approach is to first estimate T given a set of observations of X, then deduce the 
distribution of A^ from the estimate. We show that the estimate is precise enough to 
allow complete disambiguation of the observations. 
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1 .2 Estimation of the Ion Statistics 

With negligible quantization error, the observation model becomes X = xN. Estima- 
tion of T is trivial; all that is required is to observe the event {A^ = 1 } i.e. {X = T x 1 }, 
or the two events {xi — t/} and {x2 = t(/+ 1)} and compute the difference X2 —xi. 
To recover A^, it then suffices to consider X/t. The quantization error may be ne- 
glected when T ^ 1 in the observation model X = [tA^J and the previous estimates 
provide T with a precision on the order of the quantization error 

In the general case, we can recover the samples of from the samples of X when 
the mapping x [txJ is injective. The inverse mapping is y i-^ \y/'^^ ■ We call this 
situation the distinguishible case. It occurs if and only if T > 1 (see proof in the 
Appendix, Prop|6l). In this situation, the semi-parametric approach can be separated 
into parametric estimation of the gain parameter T then non parametric estimation of 
the distribution of A^ from iid samples. 

Example 1 

> data = floor (1.32 * n) 
7o Distinguishible case 

> n 

[1] 1 2 3 4 5 6 7 8 9 10 

> data 

[1] 1 2 3 5 6 7 9 10 11 13 

When T is smaller than 1, the truncation error merges adjacent values of A^. In the 
following example, the events {A^ = 3} and {A^ = 4} cannot be distinguished in the 
data set. This is because the corresponding observation is {X ~ 2} in both cases. 

Example 2 

> data = floor (0.68 * n) 

7, Non distinguishible case 

> n 

[1] 1 2 3 4 5 6 7 8 9 10 

> data 

[1] 0122344566 

In the distinguishible case, it is natural to sort and index the observed values in 
order to determine the mapping x ^ [txJ . This is not sufficient in practice because 
of missing values or outliers which can modify the indexes. 

1.3 Observation Set 

The gain parameter and the law of A^ have separate effects on X. In the distinguishi- 
ble case, the distribution function of X is a transformation of the distribution function 
of A^ by the mapping x ^ [tjcJ . The gain parameter and the truncation error only 
distort the position of each peak, whereas the relative frequencies are unchanged. 
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Consequently, the support .y of the empirical distribution is sufficient information 
for estimating T whereas the empirical frequencies are sufficient information for the 
distribution of A^. 

In the non distinguishible case, the set of observed integers is always N for large 
samples (see Section O. As a consequence, T cannot be estimated based on that set 
alone. A semi-parametric approach is not feasible either For instance, we cannot 
estimate the mean E[A^] but only E[X] = tE[A^]. To separate T and A^, we have to 
provide prior assumptions on the distribution of like a Poisson parametric family. 

In the following, we study properties of the set 5^ of observed integers. This set can 
be constructed from the dataset in ^(nlog(n)) time using sorting for example. The 
algorithmic complexity of the following algorithms is governed by the size of and 
in particular, the maximum integer in 5^ . 

We focus on the distinguishible case, and perform parametric estimation of the gain 
parameter from a random set of integers. As the support of the empirical distribution 
is a sufficient statistic for T, we use the statistical structure [Q. = 2^ , T, Pt , T G ] 1 , +0° [ ) 
where Q. is the power set of N and T is the exhaustive (7-algebra on Q.. 
(Pt, T e ] 1 , +°°[ ) is a parametric family of distributions on Q. that is implicitly gener- 
ated in the following way. For a fixed integer n and fixed but unknown integer-valued 
random variable A^, Pt is the distribution of the random variable ^ which is the set 
of observed integers in an independent identically distributed sample (Xi, . . . ,X„) of 
X ^ [tN\ . 

1 .4 Organization of the paper 

To estimate T in the distinguishible case, we first provide simple estimators for T 
in Section 121 These are later used as a starting point for improved estimators and to 
restrict the search space for T. 

In Section[3l we define the notion of compatible values and provide a few properties 
of the set of compatible values. In particular, the true parameter T is a compatible 
value and is close to the highest compatible value. This leads to an optimal estimator 
that is described in 13. 31 

We show the results of some simulations in section |4] and compare with the Max- 
imum Likelihood Estimator obtained from the Double Poisson Family, an estimator 
based on linear regression and another one based on Fourier transform. 

2 Estimators and Upper Bounds for t 

The results in this section are based on the following idea. Two points in are 
separated by at least [tJ . Consequently, when T is large, then ^ is a sparse set, 
whereas is dense when t is near 1 . For instance, there are consecutive points in 
if and only if T < 2 (see Proposition|7]in the Appendix). 
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A better estimate can be obtained by combining more than 2 consecutive points. 
Let |x,>']] denote the set of integers between x and y. If \x,y\ is a subset of ,S^, then 

T < IH . Consequently, x can be estimated by IH with a precision on 

y—x y—x 

the order of the inverse of the length of the interval . However, this estimator is 

y-x 

strongly affected by missing values in 

Instead of considering all the segments in S^, we propose to use the overall density 
of the set, which is easier to compute algorithmically. Let:^ = [t nj denote the largest 

x+ 1 

integer in S^. Then T < — - — . When n is unknown (because of potential missing 

n 

values), let n denote the number of non zero observed integers i.e. the number of 
elements in J^. Then 

i+1 x+l 

T < < . 

n n 

x+ I 

Consequently, is an estimate of t with precision on the order of I /n (without 

n 

missing values). As it uses the whole data, it is usually more precise than the previous 
bound. We will use this in the rest of the paper to restrict the search space for T. 

Let us compare the previous bounds on an example. Suppose that T = 1.32 and 
^ = {1,2,3,5,6,7,9,10,11,13}. 
As there are consecutive integers in ^ we obtain t < 2 . 
Using the interval |5,7], we obtain T < 1 + 1/2. 
Using the interval [9, 1 1|, we obtain T < 1 + 1/2 as well. 
The density upper bound is T < 14/10. 

Remark We only provided upper bounds in this section because lower bounds can 
only be deduced from the integers that cannot be generated in the model. These are 
difficult to distinguish from missing values, which are integers that can be generated 
in the model, but do not appear in the set ^ of observed integers. 

3 Compatible Values 

The upper bounds that we proposed in the previous section are easy to compute but 
rather poor because they only take into account the proportion of observed integers. 
In this section, we describe an algorithm with higher computational load but which 
can leverage the information in the location of each observed integer in the data set. 

3.1 Lattices of Integers 

In the observation model X = [tN\ where A'^ is integer valued, only specific integers 
can be generated. Given a strictly positive real number t, let us define the set of 



6 



possible values for x as the lattice associated to t, i.e. the infinite set of integers 
,5^t = {x= \ tk\^k <E N}. The set of observed integers ,5^ is also called the empirical 
lattice. 

With infinitely many observations, the parameter T is completely characterized by 
the empirical lattice as the following proposition shows. This justifies that is suf- 
ficient information for estimating T. 

Proposition 1 (Equivalence between lattices and numbers) In the distinguishible 
case, lett\ andti denote two real numbers such thatt\ > 1 andt2 > 1. Then .yt^ = ,^12 
if and only if ti = tj- 

Proof Obviously, if ti ~ t2 then J^,, = Let us prove the converse, i.e. J^,^ — S^ti 
implies ti = t2 or equivalently if ti 7^ t2 then S^f^ ^ S^t^. Suppose that ti < t2- There 
exists n g N such that [tin\ < [t2n\ . Either [f2nj ^ in which case ^ S^t2' or 
[t2n\ — [finij with nj > n. In the latter case, distinguishibility implies that there are 
strictly more elements in HA than in S^t2 where A denotes the set of integers 
lO,Lf2nJl. 

3.2 The Set of Compatible Values 

Proposition [T] is not sufficient for estimating T because in practice we only observe 
a finite set ^ S^-^. Consequently we define the notion of compatible lattices and 
equivalently compatible values. For any positive real f, we say that t is compatible 
with the data if ^ C Likewise, for any two sets A and B, A is compatible with 
B if B C A. Being compatible with the data set is a necessary condition for a valid 
estimator of T. 

The set of values that are compatible with the infinite lattice is adequate for 
estimating T because of the following proposition. 

Proposition 2 t is the largest real number in "^^t)- 

Proof T is a compatible value, we only have to show that it is the largest. 
Let u denote a real number greater than T, and let a denote a positive real number 
such that T <T + a <u. We will prove that u is not compatible with T by constructing 
an element in that cannot be in 

Let a denote a positive integer such that a > ^, and n ~ [Ta\ . 

Suppose that .5^^ C then n belongs to S^n, and there exists a positive integer a' 

such that n ~ Yua'\ . 

a' >a because in the distinguishible case, a and a' correspond to their indices in the 
sets and ^„ and C 

Moreover, as T < m we have [taJ < [ua\ < [ua'l . For all three terms to be equal to 
n in the distinguishible case requires that a = a'. 

Consequently, both T and u lie in the interval ^^[. As a result, |t — m| < ^ which 
contradicts a> ^. 
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The set '^{■yr) has an intricate structure. It contains the positive real numbers smaller 
than 1 and the harmonics {|,^ G N*}, but these are not the only values. For exam- 
ple, 4/3 is compatible with 2 because every even integer can be written as x 4/3J , 
k G N. Indeed, let k be an even integer. Either A: is a multiple of 4, in which case 
k^4i^ [| x3/J,or;t = 4/ + 2= [| x (3/ + 2)J. 

3.3 Estimation with a Finite Lattice 

In practice, the empirical lattice is finite and can contain missing values and outliers. 
We say that an integer is missing from y when it is in the theoretical lattice S^r, 
smaller than x = max S", but not in S^. The set of compatible values with is a finite 
union of intervals and compatible values are never isolated. As the data contains less 
information, the true parameter T is not the supremum of '^(^), but it is still maximal 
in the following sense. 

Proposition 3 The set of compatible values "^^J?^) contains exactly ]0, 1] and inter- 
vals of length at least \/x^ where x — max J/'. In particular, if there are no outliers 
or missing values in ^ then T belongs to the interval [a,b[ such that b = sup'^#(j5^). 

The proof is based on the following two lemmas. 

Lemma 1 The set of compatible values contains exactly ]0, 1] and a finite number of 
intervals of the form [a,b[ of length at least 1/x^ where x = max J5^. 

Proof Let f > 1 denote a compatible value. For each observed value x e there 
exists an integer n such than x — \ tn\ . Consequently, f verifies f € , [■ The inter- 
section of the constraints f G f- , \ for all jc € ^ is an interval f G [—,—[. All values 
f G , ^ [ verify all of the constraints and are thus compatible. The length of this in- 
terval is ^ — ^ which is at least 1 / inini). In the distinguishible case, n\ < max^J^ 
and n2 < max^, which implies that the length is at least 1 /(maxJ^)^. 

Lemma 2 Let t be a positive real number that is compatible with the empirical lat- 

x+1 

tice. Then t < . 

n 

Proof This follows directly from the upper bounds in Section|2l See Proposition|9]in 
the Appendix. 

To complete the proof, it suffices to show that t belongs to the largest interval. 

Proof There are only finitely many intervals of length at least in [0, f^], so 
there exists such an interval [a,b[. 

Let ^ denote the set = {n \ [nrj G ^}, i.e. the set of values for that generate 
.y. T belongs to a certain interval [a',b'[ which is the intersection of the constraints 
T G [-, for all X — [nrj in ,y. We show that b' = b, i.e. no positive real is both 
greater than Z?' and compatible. Let t such that b' < t. For all n G -jy, [nrj < [wf J . As 
t ^ [a',b'[, t breaks at least one of the constraints, that is to say, there is an integer x 
in ,y such that x = [nrj < [nt\ . x is skipped in y, and thus t is not compatible. 
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The previous proposition suggests that it suffices to find the largest compatible in- 
terval to estimate T, and this is our proposed estimator f. More precisely, the set 
'^^) = U^^Jaj,&y[ is a union of J intervals, with {uj) and [bj) increasing se- 
quences, then 

_ a J + bj 



We use the following algorithm to compute f . This also computes the mapping 

X = \_Tn\ ^ n and the precision. 

- compute the set of observed values by sorting the data set and removing multiple 
occurences 

x+ 1 

- compute the upper bound x <B= where n = card5^ 

n 

- find an approximation of the largest compatible value t by testing the compatibil- 

k 

ity of the real numbers tk = B ^ 

- deduce the indexes from t, that is to say for all x € ,5^, find ; such thatx = \ ti\ 

- compute the interval [a, ^[ as the intersection of the constraints t e 
all X in ^ 

- return — - — as an estimator for T 



3.4 Properties of the Estimator 

According to the previous results, the estimator performs well when there are no 
missing values or outliers. Its precision is (b — d) /2 and can be computed inside the 
algorithm. The precision is at least 1 /«, but depending on the value of t it can reach 
a precision on the order of 1 /n^ . In all cases, the precision is better than the density 
bound, and there is a lower bound. 

If there are missing values or outliers, the algorithm may find an interval of com- 
patible values that does not contain t. For example, if the dataset is {0,2,4,6,8}, a 
reasonable estimator would answer 2 and not T = 4/3 with missing values 1 and 5. 
In practice, such cases are rare, and are related to arithmetic properties of the set y. 
However, the largest compatible value is never an erroneous answer to the problem. It 
is a parcimonious answer in the sense that it is the smallest lattice which may explain 
the dataset. 

The estimator is optimal in the sense that the algorithm finds an interval of positive 
real numbers that are all plausible. Given a dataset (xi = [t /i J , . . . ,x„ = [t /„J ) of 
size n, there is an interval of compatible values that can generate (xi, . . . ,x„) from 
the same reaUzation (j'l, . . .,?'„) of l^. Let \aj,b]\ with bj = sup "^=5^), the following 
proposition holds. 



X X- 

I 



,for 



9 



Proposition 4 Given a realization {ii ,...,/„) ofN, all values in [aj,bj[ generate the 
same data set (xi , . . . ,x„), i.e. 

Vf e [aj,b][, V; G [l,nl, = [t ij\ = [f /^J 

Proof As in the proof of Proposition [3] [oy , Z?/ [ is the intersection of the constraints 

Xj = [t ij\ . 

The data set does not contain enough information to distinguish the values m[aj,bj[. 
In particular, even if the realization (/i , . . . , /„) is given, then the values are not distin- 
guishible. Note that if xq is known not to be in S^x, then for all integers /, T > ^^i^ or 
T < y . These inequalities are not informative because they are akeady contained in 

X = [f/J,Vx e 5^. 

The program is quite fast. First because is relies only on the set 5^ which is much 
smaller than the dataset when T is near 1 and is independent identically distributed, 
because repeats of are discarded. As the following proposition shows, with few 
missing values, the density bound is precise and the algorithm is quicker All com- 
patible values can be retrieved by testing Bx^ numbers. 

Proposition 5 // there are no missing values, the largest compatible value is found 

after at most — ~Tx steps. With a small number of missing values k^h, the number 
n 

2 / 1\ 

of steps is on the order ofi x[k-\ — 1 where k = n — card^ is the number of missing 
values. 

Proof The procedure begins at B = ends before | because T > a > i, and pro- 
ceeds in steps of length l/jf. Consequently, there are at most C ~x^{ — | ) steps . 
Let k — h — cardo5^ denote the number of missing values. We make the following 

three approximations: ^ < n, 1 <iandT~ |. Then C = 1) ( ncarfj/ + «(i+i) ) 

which can be approximated by C ~ z^x{k +\). 

Testing for the compatibility of a real t is linear in the size of S', so the whole 
procedure is at most quadratic. The full set of compatible values can be obtained in 
cubic time. 

4 Results and Discussion 

4. 1 Compatible Values Estimator 

Figure [T] illustrates the compatible values estimator on a simulated dataset. The 
dataset {6,6, 11, 5, 3, 5, 2, 6, 5, 13, 2, 7, 7, 7, 6} is obtained from the observation model 
X = [1.32 *A^J where A' is distributed according to a Poisson random variable with 
mean 5.5. It is first reduced to the lattice = {2,3,5,6,7, 11, 13} and is shown at 
the bottom. 
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Fig. 1 Comparison of a few compatible lattices and the dataset. 
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Fig. 2 Comparison of the dataset and a few lattices that are not compatible. 



The vertical axis represents values of T. The set of compatible values is composed 
of several intervals and represented on the left. For each interval, we select one com- 
patible value f and represent the lattice ^t- All reals in the same interval generate the 
same lattice, up to max(^). 



For comparison, Figure|2]displays for several values that are not compatible with 
the data. For example, 5 and 1 1 are in the dataset but not in 2 ■ 



Two sources of variation affect the estimate f. First, the estimator is not perfect 
because the dataset is finite. Second, the data set y is random. Figure [3] shows the 
performance of the estimator with a fixed dataset (N £ |1, 10|) for several values of 
T. The intervals shown correspond to the intervals in ^^5^) that contain the largest 
compatible value. 



We can see that the precision of the estimator varies with T. Only the range [1,2] 
is shown because the precision only depends on the rest T — [tJ modulo 1 . Conse- 
quently, the absolute precision is roughly constant, whereas the relative precision is 

of—]. With small quantization error (t^ 1) the estimation problem is easier. 
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Fig. 3 Length of the maximal interval for several values of T with A? G [1, 10|. 
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Fig. 4 Kernel density estimate of the distribution of the compatible values estimator on a random dataset. 

Figure]?] shows the distribution of f when the dataset ,5^ is the result of 15 samples 
of X = [tA^J where T ~ 1.32 and is distributed according to a Poisson random 
variable with mean 5.5. The distribution of the estimator value is obtained from the 
200 repeats shown in the bottom of the plot thanks to a kernel estimate, even if the 
distribution is a sum of Dirac point masses. The interval shows the interval obtained 
with the (complete) dataset { [1.32 * «J ,« G |1, 13]]}. 

4.2 The Double Poisson Family 

In this section, we briefly recall the results from and deduce an estimator for 
our model. Let gfi{y) = e^^ji^' /y\ denote the distribution function of a Poisson ran- 
dom variable with mean n. The double Poisson distribution with parameters is 
defined as: 

fe,^{y)=c{d,li)d"^{g^{y)Y{g,.{y)y'' 
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Fig. 5 Kernel density estimate of the distribution of the double poisson estimate of T with flooring noise 
(solid line) and without (dotted line). 

where c is a normalization constant. 

Maximum Likelihood Estimation leads to the following estimators. Let {Yi,...,Y„) 
be independent identically distributed random variables with distribution /e^^, then 

1 " 

2ILi/(i';,M) 
where /(Mi,M2) = Aii(log(Aii) -log(Ai2)) - (Mi -^2). 

Let Yg be a random variable with distribution function /e.^, then according to fsf] 
Yg 1^ has approximately the same distribution asX/9 where X is Poisson distributed 
with mean jxO. With Poisson distribution for the ion counts, our observation model 
becomes Y = [tA^J where is Poisson distributed with mean A . Consequently, esti- 
mates for T and A can be deduced from 9 and fi with the following relations: 

, 1 

T = — 



X = fie 

The double Poisson distribution is a correct approximation of the distribution of 
X/9 for large jj., and in that case, f and A are unbiased estimates of T and A. The 
standard deviation of f is Figure |5] shows the distribution of f with flooring 
noise, i.e. in the model Y — [tA^J (solid line) and without flooring noise in the model 
Y = tN (dotted line). The plot was generated with 2000 repeats with data sets of 
size 500. We observe a large standard deviation compared with the compatible values 
estimator, even on a much larger data set. With large values of fj., truncation has 
Umited effect on the estimate. 
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Fig. 6 Kernel density estimate of the distribution of the double poisson estimate of T with flooring noise 
(solid line) and without (dotted line). 

For modeling rare ion count events, we need to study the estimators with small val- 
ues of X and T. In that case, f is strongly biased for both models as shown on Figure|6] 
This implies that the approximation is not suited to this range of parameters, and that 
the flooring noise makes a significant difference there. Figure|6]was generated using 
2000 repeats with data sets of size 500. For comparison, we show the optimal interval 
obtained by the compatible values estimator on the data set { [1.32 * «J ,« € [1,13]}. 



4.3 Fourier Estimator 

From the set tN we can construct the signal f L/teN 5 (x — ik) where 5 denotes 
the Dirac function, that is to say a periodic series of pulses. The period T may thus 
be estimated using Fourier transform. Likewise, we define the estimator \/Zf as the 
maximum of the Fourier transform of the quasi-periodic signal f :t ^ LagN 5 (jc — 

As T can be seen as a quasi-period, our estimation problem is closely linked to the 
"harmonic retrieval problem". Many approaches have been proposed in that domain 
and the main focus is on the estimation of the Power Spectral Density |6l]. However, 
the signal is usually perturbed by additive noise whereas in this paper we consider a 
distortion of the time axis. 



We use the following algorithm: 

- sample the signal / at the points x; = / for the integers / in |0, max(^)] 

- compute the Discrete Fourier Transform 

i+ 1 

- compute an upper bound using Proposition |9]: x <B = 

n 

- find the frequency with highest absolute Fourier coefficient 

- return the corresponding period (inverse of the frequency) 
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Fig. 7 Fourier transform of the quasi-periodic signal, in Fourier space (left) and period space (right). The 
vertical line shows the upper bound from Proposition|9] 



This estimator has a precision that corresponds to the sampling rate in time space 
around the true value. In the Fourier space, the sampling rate is uniform with steps of 
length 1 / max(^) which is equivalent to 1/(t x i). In the time space, as f = 1 //, 
then AP = —Af/f^ and the sampling rate is non uniform. For / = 1 /t we obtain the 
precision of the Fourier estimator as t/x. This suggests that the precision decreases 
with T. However, the signal frequency 1/t is near i/max(^) which is one of the 
sampling points. As a result, in practice, the absolute precision is on the order of 1 /x 
and independent of T. 



Figure|7]shows in the frequency and period space the Fourier transform of the quasi- 
periodic signal obtained from the dataset N ^ 11,10]]. The vertical line corresponds 
to the upper bound from Proposition|9] 

i 

Remark When oversampling by a factor k, i.e. sampUng at the points x, = - 

for the integers / in |0,max(^) x kj, the harmonics of 1 Hz increase in magnitude. 
Therefore it is necessary to weed out the frequencies above 1 Hz in the distinguishible 
case. Moreover, oversampling increases the maximum frequency that can be repre- 
sented in the Fourier space and does not improve the precision of the estimator 



On a random dataset, the Fourier estimator suffers greatly from missing values. 
Figure [S] shows the distribution of Tf with 200 simulations and a dataset of size 15 
where is distributed according to a Poisson random variable with mean 5.5. The 
precision of the estimator is much worse than the compatible values estimator (see 
the plotted interval). The Fourier estimate Zf is compatible with the dataset in only 
about 1% of the simulations. 
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T = 1 
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Fig. 8 Kernel density estimate of the distribution of the Fourier estimator. 




Fig. 9 The linear regression estimator on a dataset without missing values. 

4.4 Linear Regression Estimator 

The observation model X = [tN\ may be written X ~ tN + e where e is an error 
term. Even if e is not Gaussian, Unear regression can yield a reasonable estimate of 
the regression coefficient T as Figure|9]shows. 

We use the following algorithm: 

- compute the empirical lattice {x,} by sorting and removing duplicates in the 
dataset 

- compute the indexes {«,} according to the sorting index 

- fit a regression line of the form x, = an,- + 0.5 

- return a 

Figure |9] shows the linear regression estimator on the dataset ,y — [tJI, 10] J (no 
missing values). For each element in the dataset, if the regression line intersects the 
length 1 interval then the estimate is compatible with the data point. 
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Fig. 10 The linear regression estimator in the case of missing values. The compatible values estimator is 
shown in dotted line. 

Note that the truncation error is not centered. Consequently, we compute the regres- 
sion coefficient in the the model X + 0.5 = TA^+e. For the same reason, the regressors 
are below the regression line. 

The main difficulty in the linear regression is that the values of the regressor variable 
are unknown. In the distinguishible case, it is possible to reconstruct them when 
there are no missing values, i.e. n |0,n]] ~ £^ where n — maxS'. Otherwise, the 
regressors will be shifted and that affects strongly the estimate. Figure[TO]shows such 
a case. The regressors inferred in the linear regression estimator and the regression 
line are shown in solid line. For comparison, the true regressors are displayed in dot- 
ted line. The compatible values estimator finds the true regressors and its regression 
line is shown in dotted line. 



5 Conclusion 

In the observation model X = [tA^J , the parameter T can be reliably estimated inde- 
pendently from A^. This allows the full recovery of the statistics of A^ prior to model- 
ing. The structure of A^ may then be studied at length afterwards. 

The estimator based on compatible values is optimal and reasonably quick. It is 
resistant to missing values in practice, and in the worst case returns an acceptable 
(parcimonious) answer without hypotheses on the law of A^. 

Unfortunately, this estimator only takes into account truncation noise, and yields 
poor results on real data. We are currently pursuing an extension of the model that 
mixes electronic noise and truncation effects. 
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Compared to the other three estimators, the compatible values estimator performs 
much better but also more slowly. The Double Poisson Family is simply not a suitable 
model in our range of parameters, but there is room for improvement for the other 
estimators. For example, the main difficulty in the linear regression is computing 
the indexes. With some knowledge about the law of A', quantile regression could be 
applied. 

The Fourier estimator suggests a strong relationship with the harmonic retrieval 
problem, although the signal is not periodic. Although the truncation error considered 
in this paper is very different from Gaussian errors usually considered in harmonic 
retrieval, some algorithms from that field may make a better compromise between 
speed and precision for the current problem. 
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Appendix 

Proposition 6 The mapping x i-^ [txJ is injective if and only if T > 1. 

Proof If T = 1, the mapping is the identity function. Suppose T > 1 and let ni and n2 
denote two (positive) integers such that «i < n2- Then Tn2 — Tni > T > 1 and [Tn2j > 
[rni J . When T < 1, the mapping is not injective because [t x IJ = [t x OJ = 0. 



Upper Bounds on T 

Proposition 7 (Any two observations) Let x and y be two distinct elements of the 
set y of observed values. Then T < 1 + |x — yj. 

Proof Let / and j be the values of corresponding to x and y i.e. x = [t/J and 
y = \xj\. Then we have the inequalities: x< li < x + \, y < Tj <3'+l, and thus 
T{j ~ i) <y — x+ I. Assuming x < y, we obtain T < '^-^7- <y^x+l. 

Propositions (Observed intervals) Let fx^y} denote the set of integers between x 

andy. If\x,y\ is a subset of y, then T < 1 H . 

y — x 

Proof Using the same notations as in the proof of Proposition!?] T < '^~pr' + 
■i < 1 + -i because in the distinguishible case the number of elements in \x^y\ is 
y — X + 1 = /■-;■+ 1 . 

Proposition 9 (Density Upper Bound) Let x ~ [rnj denote the largest integer in 
i+ 1 

,y. Then T < — - — . When n is unknown (because of potential missing values), let n 
n 

x+1 x+1 

denote the number ofnon zero observed integers. Then T < — - — < . 

h n 



